Objectives {.unnumbered}¶
By the end of this lab, you will:
- Load and analyze the Lightcast dataset in Spark DataFrame.
- Create five easy and three medium-complexity visualizations using Plotly.
- Explore salary distributions, employment trends, and job postings.
- Analyze skills in relation to NAICS/SOC/ONET codes and salaries.
- Customize colors, fonts, and styles in all visualizations (default themes result in a 2.5-point deduction).
- Follow best practices for reporting on data communication.
Step 1: Load the Dataset {.unnumbered}¶
import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "vscode"
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
# Initialize Spark Session
spark = SparkSession.builder.appName("LightcastData").getOrCreate()
# Load the csv file into a spark dataframe
df = spark.read.option("header", "true").option("inferSchema", "true").option("multiLine","true").option("escape", "\"").csv("./data/lightcast_job_postings.csv")
# Show Schema and Sample Data
df.printSchema()
df.show(5)
root |-- ID: string (nullable = true) |-- LAST_UPDATED_DATE: string (nullable = true) |-- LAST_UPDATED_TIMESTAMP: timestamp (nullable = true) |-- DUPLICATES: integer (nullable = true) |-- POSTED: string (nullable = true) |-- EXPIRED: string (nullable = true) |-- DURATION: integer (nullable = true) |-- SOURCE_TYPES: string (nullable = true) |-- SOURCES: string (nullable = true) |-- URL: string (nullable = true) |-- ACTIVE_URLS: string (nullable = true) |-- ACTIVE_SOURCES_INFO: string (nullable = true) |-- TITLE_RAW: string (nullable = true) |-- BODY: string (nullable = true) |-- MODELED_EXPIRED: string (nullable = true) |-- MODELED_DURATION: integer (nullable = true) |-- COMPANY: integer (nullable = true) |-- COMPANY_NAME: string (nullable = true) |-- COMPANY_RAW: string (nullable = true) |-- COMPANY_IS_STAFFING: boolean (nullable = true) |-- EDUCATION_LEVELS: string (nullable = true) |-- EDUCATION_LEVELS_NAME: string (nullable = true) |-- MIN_EDULEVELS: integer (nullable = true) |-- MIN_EDULEVELS_NAME: string (nullable = true) |-- MAX_EDULEVELS: integer (nullable = true) |-- MAX_EDULEVELS_NAME: string (nullable = true) |-- EMPLOYMENT_TYPE: integer (nullable = true) |-- EMPLOYMENT_TYPE_NAME: string (nullable = true) |-- MIN_YEARS_EXPERIENCE: integer (nullable = true) |-- MAX_YEARS_EXPERIENCE: integer (nullable = true) |-- IS_INTERNSHIP: boolean (nullable = true) |-- SALARY: integer (nullable = true) |-- REMOTE_TYPE: integer (nullable = true) |-- REMOTE_TYPE_NAME: string (nullable = true) |-- ORIGINAL_PAY_PERIOD: string (nullable = true) |-- SALARY_TO: integer (nullable = true) |-- SALARY_FROM: integer (nullable = true) |-- LOCATION: string (nullable = true) |-- CITY: string (nullable = true) |-- CITY_NAME: string (nullable = true) |-- COUNTY: integer (nullable = true) |-- COUNTY_NAME: string (nullable = true) |-- MSA: integer (nullable = true) |-- MSA_NAME: string (nullable = true) |-- STATE: integer (nullable = true) |-- STATE_NAME: string (nullable = true) |-- COUNTY_OUTGOING: integer (nullable = true) |-- COUNTY_NAME_OUTGOING: string (nullable = true) |-- COUNTY_INCOMING: integer (nullable = true) |-- COUNTY_NAME_INCOMING: string (nullable = true) |-- MSA_OUTGOING: integer (nullable = true) |-- MSA_NAME_OUTGOING: string (nullable = true) |-- MSA_INCOMING: integer (nullable = true) |-- MSA_NAME_INCOMING: string (nullable = true) |-- NAICS2: integer (nullable = true) |-- NAICS2_NAME: string (nullable = true) |-- NAICS3: integer (nullable = true) |-- NAICS3_NAME: string (nullable = true) |-- NAICS4: integer (nullable = true) |-- NAICS4_NAME: string (nullable = true) |-- NAICS5: integer (nullable = true) |-- NAICS5_NAME: string (nullable = true) |-- NAICS6: integer (nullable = true) |-- NAICS6_NAME: string (nullable = true) |-- TITLE: string (nullable = true) |-- TITLE_NAME: string (nullable = true) |-- TITLE_CLEAN: string (nullable = true) |-- SKILLS: string (nullable = true) |-- SKILLS_NAME: string (nullable = true) |-- SPECIALIZED_SKILLS: string (nullable = true) |-- SPECIALIZED_SKILLS_NAME: string (nullable = true) |-- CERTIFICATIONS: string (nullable = true) |-- CERTIFICATIONS_NAME: string (nullable = true) |-- COMMON_SKILLS: string (nullable = true) |-- COMMON_SKILLS_NAME: string (nullable = true) |-- SOFTWARE_SKILLS: string (nullable = true) |-- SOFTWARE_SKILLS_NAME: string (nullable = true) |-- ONET: string (nullable = true) |-- ONET_NAME: string (nullable = true) |-- ONET_2019: string (nullable = true) |-- ONET_2019_NAME: string (nullable = true) |-- CIP6: string (nullable = true) |-- CIP6_NAME: string (nullable = true) |-- CIP4: string (nullable = true) |-- CIP4_NAME: string (nullable = true) |-- CIP2: string (nullable = true) |-- CIP2_NAME: string (nullable = true) |-- SOC_2021_2: string (nullable = true) |-- SOC_2021_2_NAME: string (nullable = true) |-- SOC_2021_3: string (nullable = true) |-- SOC_2021_3_NAME: string (nullable = true) |-- SOC_2021_4: string (nullable = true) |-- SOC_2021_4_NAME: string (nullable = true) |-- SOC_2021_5: string (nullable = true) |-- SOC_2021_5_NAME: string (nullable = true) |-- LOT_CAREER_AREA: integer (nullable = true) |-- LOT_CAREER_AREA_NAME: string (nullable = true) |-- LOT_OCCUPATION: integer (nullable = true) |-- LOT_OCCUPATION_NAME: string (nullable = true) |-- LOT_SPECIALIZED_OCCUPATION: integer (nullable = true) |-- LOT_SPECIALIZED_OCCUPATION_NAME: string (nullable = true) |-- LOT_OCCUPATION_GROUP: integer (nullable = true) |-- LOT_OCCUPATION_GROUP_NAME: string (nullable = true) |-- LOT_V6_SPECIALIZED_OCCUPATION: integer (nullable = true) |-- LOT_V6_SPECIALIZED_OCCUPATION_NAME: string (nullable = true) |-- LOT_V6_OCCUPATION: integer (nullable = true) |-- LOT_V6_OCCUPATION_NAME: string (nullable = true) |-- LOT_V6_OCCUPATION_GROUP: integer (nullable = true) |-- LOT_V6_OCCUPATION_GROUP_NAME: string (nullable = true) |-- LOT_V6_CAREER_AREA: integer (nullable = true) |-- LOT_V6_CAREER_AREA_NAME: string (nullable = true) |-- SOC_2: string (nullable = true) |-- SOC_2_NAME: string (nullable = true) |-- SOC_3: string (nullable = true) |-- SOC_3_NAME: string (nullable = true) |-- SOC_4: string (nullable = true) |-- SOC_4_NAME: string (nullable = true) |-- SOC_5: string (nullable = true) |-- SOC_5_NAME: string (nullable = true) |-- LIGHTCAST_SECTORS: string (nullable = true) |-- LIGHTCAST_SECTORS_NAME: string (nullable = true) |-- NAICS_2022_2: integer (nullable = true) |-- NAICS_2022_2_NAME: string (nullable = true) |-- NAICS_2022_3: integer (nullable = true) |-- NAICS_2022_3_NAME: string (nullable = true) |-- NAICS_2022_4: integer (nullable = true) |-- NAICS_2022_4_NAME: string (nullable = true) |-- NAICS_2022_5: integer (nullable = true) |-- NAICS_2022_5_NAME: string (nullable = true) |-- NAICS_2022_6: integer (nullable = true) |-- NAICS_2022_6_NAME: string (nullable = true)
25/03/25 02:10:44 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
+--------------------+-----------------+----------------------+----------+--------+---------+--------+--------------------+--------------------+--------------------+-----------+-------------------+--------------------+--------------------+---------------+----------------+--------+--------------------+-----------+-------------------+----------------+---------------------+-------------+-------------------+-------------+------------------+---------------+--------------------+--------------------+--------------------+-------------+------+-----------+----------------+-------------------+---------+-----------+--------------------+--------------------+-------------+------+--------------+-----+--------------------+-----+----------+---------------+--------------------+---------------+--------------------+------------+--------------------+------------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+----------+---------------+----------+---------------+---------------+--------------------+--------------+--------------------+--------------------------+-------------------------------+--------------------+-------------------------+-----------------------------+----------------------------------+-----------------+----------------------+-----------------------+----------------------------+------------------+-----------------------+-------+--------------------+-------+--------------------+-------+---------------+-------+---------------+-----------------+----------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+
| ID|LAST_UPDATED_DATE|LAST_UPDATED_TIMESTAMP|DUPLICATES| POSTED| EXPIRED|DURATION| SOURCE_TYPES| SOURCES| URL|ACTIVE_URLS|ACTIVE_SOURCES_INFO| TITLE_RAW| BODY|MODELED_EXPIRED|MODELED_DURATION| COMPANY| COMPANY_NAME|COMPANY_RAW|COMPANY_IS_STAFFING|EDUCATION_LEVELS|EDUCATION_LEVELS_NAME|MIN_EDULEVELS| MIN_EDULEVELS_NAME|MAX_EDULEVELS|MAX_EDULEVELS_NAME|EMPLOYMENT_TYPE|EMPLOYMENT_TYPE_NAME|MIN_YEARS_EXPERIENCE|MAX_YEARS_EXPERIENCE|IS_INTERNSHIP|SALARY|REMOTE_TYPE|REMOTE_TYPE_NAME|ORIGINAL_PAY_PERIOD|SALARY_TO|SALARY_FROM| LOCATION| CITY| CITY_NAME|COUNTY| COUNTY_NAME| MSA| MSA_NAME|STATE|STATE_NAME|COUNTY_OUTGOING|COUNTY_NAME_OUTGOING|COUNTY_INCOMING|COUNTY_NAME_INCOMING|MSA_OUTGOING| MSA_NAME_OUTGOING|MSA_INCOMING| MSA_NAME_INCOMING|NAICS2| NAICS2_NAME|NAICS3| NAICS3_NAME|NAICS4| NAICS4_NAME|NAICS5| NAICS5_NAME|NAICS6| NAICS6_NAME| TITLE| TITLE_NAME| TITLE_CLEAN| SKILLS| SKILLS_NAME| SPECIALIZED_SKILLS|SPECIALIZED_SKILLS_NAME| CERTIFICATIONS| CERTIFICATIONS_NAME| COMMON_SKILLS| COMMON_SKILLS_NAME| SOFTWARE_SKILLS|SOFTWARE_SKILLS_NAME| ONET| ONET_NAME| ONET_2019| ONET_2019_NAME| CIP6| CIP6_NAME| CIP4| CIP4_NAME| CIP2| CIP2_NAME|SOC_2021_2| SOC_2021_2_NAME|SOC_2021_3| SOC_2021_3_NAME|SOC_2021_4|SOC_2021_4_NAME|SOC_2021_5|SOC_2021_5_NAME|LOT_CAREER_AREA|LOT_CAREER_AREA_NAME|LOT_OCCUPATION| LOT_OCCUPATION_NAME|LOT_SPECIALIZED_OCCUPATION|LOT_SPECIALIZED_OCCUPATION_NAME|LOT_OCCUPATION_GROUP|LOT_OCCUPATION_GROUP_NAME|LOT_V6_SPECIALIZED_OCCUPATION|LOT_V6_SPECIALIZED_OCCUPATION_NAME|LOT_V6_OCCUPATION|LOT_V6_OCCUPATION_NAME|LOT_V6_OCCUPATION_GROUP|LOT_V6_OCCUPATION_GROUP_NAME|LOT_V6_CAREER_AREA|LOT_V6_CAREER_AREA_NAME| SOC_2| SOC_2_NAME| SOC_3| SOC_3_NAME| SOC_4| SOC_4_NAME| SOC_5| SOC_5_NAME|LIGHTCAST_SECTORS|LIGHTCAST_SECTORS_NAME|NAICS_2022_2| NAICS_2022_2_NAME|NAICS_2022_3| NAICS_2022_3_NAME|NAICS_2022_4| NAICS_2022_4_NAME|NAICS_2022_5| NAICS_2022_5_NAME|NAICS_2022_6| NAICS_2022_6_NAME|
+--------------------+-----------------+----------------------+----------+--------+---------+--------+--------------------+--------------------+--------------------+-----------+-------------------+--------------------+--------------------+---------------+----------------+--------+--------------------+-----------+-------------------+----------------+---------------------+-------------+-------------------+-------------+------------------+---------------+--------------------+--------------------+--------------------+-------------+------+-----------+----------------+-------------------+---------+-----------+--------------------+--------------------+-------------+------+--------------+-----+--------------------+-----+----------+---------------+--------------------+---------------+--------------------+------------+--------------------+------------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+----------+---------------+----------+---------------+---------------+--------------------+--------------+--------------------+--------------------------+-------------------------------+--------------------+-------------------------+-----------------------------+----------------------------------+-----------------+----------------------+-----------------------+----------------------------+------------------+-----------------------+-------+--------------------+-------+--------------------+-------+---------------+-------+---------------+-----------------+----------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+
|1f57d95acf4dc67ed...| 9/6/2024| 2024-09-06 20:32:...| 0|6/2/2024| 6/8/2024| 6| [\n "Company"\n]|[\n "brassring.c...|[\n "https://sjo...| []| NULL|Enterprise Analys...|31-May-2024\n\nEn...| 6/8/2024| 6| 894731| Murphy USA| Murphy USA| false| [\n 2\n]| [\n "Bachelor's ...| 2| Bachelor's degree| NULL| NULL| 1|Full-time (> 32 h...| 2| 2| false| NULL| 0| [None]| NULL| NULL| NULL|{\n "lat": 33.20...|RWwgRG9yYWRvLCBBUg==|El Dorado, AR| 5139| Union, AR|20980| El Dorado, AR| 5| Arkansas| 5139| Union, AR| 5139| Union, AR| 20980| El Dorado, AR| 20980| El Dorado, AR| 44| Retail Trade| 441|Motor Vehicle and...| 4413|Automotive Parts,...| 44133|Automotive Parts ...|441330|Automotive Parts ...|ET29C073C03D1F86B4|Enterprise Analysts|enterprise analys...|[\n "KS126DB6T06...|[\n "Merchandisi...|[\n "KS126DB6T06...| [\n "Merchandisi...| []| []|[\n "KS126706DPF...|[\n "Mathematics...|[\n "KS440W865GC...|[\n "SQL (Progra...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...|[\n "45.0601",\n...|[\n "Economics, ...|[\n "45.06",\n ...|[\n "Economics",...|[\n "45",\n "27...|[\n "Social Scie...| 15-0000|Computer and Math...| 15-2000|Mathematical Scie...| 15-2050|Data Scientists| 15-2051|Data Scientists| 23|Information Techn...| 231010|Business Intellig...| 23101011| General ERP Analy...| 2310| Business Intellig...| 23101011| General ERP Analy...| 231010| Business Intellig...| 2310| Business Intellig...| 23| Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists| [\n 7\n]| [\n "Artificial ...| 44| Retail Trade| 441|Motor Vehicle and...| 4413|Automotive Parts,...| 44133|Automotive Parts ...| 441330|Automotive Parts ...|
|0cb072af26757b6c4...| 8/2/2024| 2024-08-02 17:08:...| 0|6/2/2024| 8/1/2024| NULL| [\n "Job Board"\n]| [\n "maine.gov"\n]|[\n "https://job...| []| NULL|Oracle Consultant...|Oracle Consultant...| 8/1/2024| NULL| 133098|Smx Corporation L...| SMX| true| [\n 99\n]| [\n "No Educatio...| 99|No Education Listed| NULL| NULL| 1|Full-time (> 32 h...| 3| 3| false| NULL| 1| Remote| NULL| NULL| NULL|{\n "lat": 44.31...| QXVndXN0YSwgTUU=| Augusta, ME| 23011| Kennebec, ME|12300|Augusta-Watervill...| 23| Maine| 23011| Kennebec, ME| 23011| Kennebec, ME| 12300|Augusta-Watervill...| 12300|Augusta-Watervill...| 56|Administrative an...| 561|Administrative an...| 5613| Employment Services| 56132|Temporary Help Se...|561320|Temporary Help Se...|ET21DDA63780A7DC09| Oracle Consultants|oracle consultant...|[\n "KS122626T55...|[\n "Procurement...|[\n "KS122626T55...| [\n "Procurement...| []| []| []| []|[\n "BGSBF3F508F...|[\n "Oracle Busi...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...| []| []| []| []| []| []| 15-0000|Computer and Math...| 15-2000|Mathematical Scie...| 15-2050|Data Scientists| 15-2051|Data Scientists| 23|Information Techn...| 231010|Business Intellig...| 23101012| Oracle Consultant...| 2310| Business Intellig...| 23101012| Oracle Consultant...| 231010| Business Intellig...| 2310| Business Intellig...| 23| Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists| NULL| NULL| 56|Administrative an...| 561|Administrative an...| 5613| Employment Services| 56132|Temporary Help Se...| 561320|Temporary Help Se...|
|85318b12b3331fa49...| 9/6/2024| 2024-09-06 20:32:...| 1|6/2/2024| 7/7/2024| 35| [\n "Job Board"\n]|[\n "dejobs.org"\n]|[\n "https://dej...| []| NULL| Data Analyst|Taking care of pe...| 6/10/2024| 8|39063746| Sedgwick| Sedgwick| false| [\n 2\n]| [\n "Bachelor's ...| 2| Bachelor's degree| NULL| NULL| 1|Full-time (> 32 h...| 5| NULL| false| NULL| 0| [None]| NULL| NULL| NULL|{\n "lat": 32.77...| RGFsbGFzLCBUWA==| Dallas, TX| 48113| Dallas, TX|19100|Dallas-Fort Worth...| 48| Texas| 48113| Dallas, TX| 48113| Dallas, TX| 19100|Dallas-Fort Worth...| 19100|Dallas-Fort Worth...| 52|Finance and Insur...| 524|Insurance Carrier...| 5242|Agencies, Brokera...| 52429|Other Insurance R...|524291| Claims Adjusting|ET3037E0C947A02404| Data Analysts| data analyst|[\n "KS1218W78FG...|[\n "Management"...|[\n "ESF3939CE1F...| [\n "Exception R...|[\n "KS683TN76T7...|[\n "Security Cl...|[\n "KS1218W78FG...|[\n "Management"...|[\n "KS126HY6YLT...|[\n "Microsoft O...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...| []| []| []| []| []| []| 15-0000|Computer and Math...| 15-2000|Mathematical Scie...| 15-2050|Data Scientists| 15-2051|Data Scientists| 23|Information Techn...| 231113|Data / Data Minin...| 23111310| Data Analyst| 2311| Data Analysis and...| 23111310| Data Analyst| 231113| Data / Data Minin...| 2311| Data Analysis and...| 23| Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists| NULL| NULL| 52|Finance and Insur...| 524|Insurance Carrier...| 5242|Agencies, Brokera...| 52429|Other Insurance R...| 524291| Claims Adjusting|
|1b5c3941e54a1889e...| 9/6/2024| 2024-09-06 20:32:...| 1|6/2/2024|7/20/2024| 48| [\n "Job Board"\n]|[\n "disabledper...|[\n "https://www...| []| NULL|Sr. Lead Data Mgm...|About this role:\...| 6/12/2024| 10|37615159| Wells Fargo|Wells Fargo| false| [\n 99\n]| [\n "No Educatio...| 99|No Education Listed| NULL| NULL| 1|Full-time (> 32 h...| 3| NULL| false| NULL| 0| [None]| NULL| NULL| NULL|{\n "lat": 33.44...| UGhvZW5peCwgQVo=| Phoenix, AZ| 4013| Maricopa, AZ|38060|Phoenix-Mesa-Chan...| 4| Arizona| 4013| Maricopa, AZ| 4013| Maricopa, AZ| 38060|Phoenix-Mesa-Chan...| 38060|Phoenix-Mesa-Chan...| 52|Finance and Insur...| 522|Credit Intermedia...| 5221|Depository Credit...| 52211| Commercial Banking|522110| Commercial Banking|ET2114E0404BA30075|Management Analysts|sr lead data mgmt...|[\n "KS123QX62QY...|[\n "Exit Strate...|[\n "KS123QX62QY...| [\n "Exit Strate...| []| []|[\n "KS7G6NP6R6L...|[\n "Reliability...|[\n "KS4409D76NW...|[\n "SAS (Softwa...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...| []| []| []| []| []| []| 15-0000|Computer and Math...| 15-2000|Mathematical Scie...| 15-2050|Data Scientists| 15-2051|Data Scientists| 23|Information Techn...| 231113|Data / Data Minin...| 23111310| Data Analyst| 2311| Data Analysis and...| 23111310| Data Analyst| 231113| Data / Data Minin...| 2311| Data Analysis and...| 23| Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists| [\n 6\n]| [\n "Data Privac...| 52|Finance and Insur...| 522|Credit Intermedia...| 5221|Depository Credit...| 52211| Commercial Banking| 522110| Commercial Banking|
|cb5ca25f02bdf25c1...| 6/19/2024| 2024-06-19 07:00:00| 0|6/2/2024|6/17/2024| 15|[\n "FreeJobBoar...|[\n "craigslist....|[\n "https://mod...| []| NULL|Comisiones de $10...|Comisiones de $10...| 6/17/2024| 15| 0| Unclassified| LH/GM| false| [\n 99\n]| [\n "No Educatio...| 99|No Education Listed| NULL| NULL| 3|Part-time / full-...| NULL| NULL| false| 92500| 0| [None]| year| 150000| 35000|{\n "lat": 37.63...| TW9kZXN0bywgQ0E=| Modesto, CA| 6099|Stanislaus, CA|33700| Modesto, CA| 6|California| 6099| Stanislaus, CA| 6099| Stanislaus, CA| 33700| Modesto, CA| 33700| Modesto, CA| 99|Unclassified Indu...| 999|Unclassified Indu...| 9999|Unclassified Indu...| 99999|Unclassified Indu...|999999|Unclassified Indu...|ET0000000000000000| Unclassified|comisiones de por...| []| []| []| []| []| []| []| []| []| []|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...| []| []| []| []| []| []| 15-0000|Computer and Math...| 15-2000|Mathematical Scie...| 15-2050|Data Scientists| 15-2051|Data Scientists| 23|Information Techn...| 231010|Business Intellig...| 23101012| Oracle Consultant...| 2310| Business Intellig...| 23101012| Oracle Consultant...| 231010| Business Intellig...| 2310| Business Intellig...| 23| Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists| NULL| NULL| 99|Unclassified Indu...| 999|Unclassified Indu...| 9999|Unclassified Indu...| 99999|Unclassified Indu...| 999999|Unclassified Indu...|
+--------------------+-----------------+----------------------+----------+--------+---------+--------+--------------------+--------------------+--------------------+-----------+-------------------+--------------------+--------------------+---------------+----------------+--------+--------------------+-----------+-------------------+----------------+---------------------+-------------+-------------------+-------------+------------------+---------------+--------------------+--------------------+--------------------+-------------+------+-----------+----------------+-------------------+---------+-----------+--------------------+--------------------+-------------+------+--------------+-----+--------------------+-----+----------+---------------+--------------------+---------------+--------------------+------------+--------------------+------------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+----------+---------------+----------+---------------+---------------+--------------------+--------------+--------------------+--------------------------+-------------------------------+--------------------+-------------------------+-----------------------------+----------------------------------+-----------------+----------------------+-----------------------+----------------------------+------------------+-----------------------+-------+--------------------+-------+--------------------+-------+---------------+-------+---------------+-----------------+----------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+
only showing top 5 rows
Salary Distribution by Employment Type¶
- Identify salary trends across different employment types.
- Filter the dataset
- Remove records where salary is missing or zero.
- Aggregate Data
- Group by employment type and compute salary distribution.
- Visualize results
- Create a box plot where:
- X-axis =
EMPLOYMENT_TYPE_NAME - Y-axis =
SALARY_FROM
- X-axis =
- Customize colors, fonts, and styles to avoid a 2.5-point deduction.
- Create a box plot where:
- Explanation: Write two sentences about what the graph reveals.
# Your Code for 1st question here
pdf = df.select("EMPLOYMENT_TYPE_NAME", "SALARY").toPandas()
fig = px.box(pdf, x="EMPLOYMENT_TYPE_NAME", y="SALARY", title="Salary Distribution by Employment Type", color_discrete_sequence=["#636EFA"])
fig.update_layout(font_family="Arial", title_font_size=16)
fig.show()
This box plot shows that full-time positions generally offer higher salaries compared to part-time, temporary, and contract roles. The wide range of salaries within full-time roles suggests a diversity in job levels and responsibilities across these positions.
Salary Distribution by Industry¶
- Compare salary variations across industries.
- Filter the dataset
- Keep records where salary is greater than zero.
- Aggregate Data
- Group by NAICS industry codes.
- Visualize results
- Create a box plot where:
- X-axis =
NAICS2_NAME - Y-axis =
SALARY_FROM
- X-axis =
- Customize colors, fonts, and styles.
- Create a box plot where:
- Explanation: Write two sentences about what the graph reveals.
import plotly.express as px
import plotly.io as pio
# Используем встроенный шаблон
pio.templates.default = "plotly_white"
# Фильтруем строки, где salary больше 0
industry_df = df.filter((col("SALARY_FROM").isNotNull()) & (col("SALARY_FROM") > 0))
# Конвертируем нужные столбцы в pandas
pdf = industry_df.select("NAICS2_NAME", "SALARY_FROM").toPandas()
# Строим box plot
fig = px.box(
pdf,
x="NAICS2_NAME",
y="SALARY_FROM",
title="Salary Distribution by Industry",
template="plotly_white",
color="NAICS2_NAME"
)
# Настройка шрифта, цветов и подписей
fig.update_layout(
font=dict(family="Arial", size=14, color="#333"),
title_font=dict(size=20),
xaxis_title="Industry (NAICS2)",
yaxis_title="Salary From",
showlegend=False
)
fig.show()
This box plot shows how salaries vary across different industries. While some industries have tightly grouped salaries, others show a wider spread, indicating a broader range of roles and seniority levels.
Job Posting Trends Over Time¶
- Analyze how job postings fluctuate over time.
- Aggregate Data
- Count job postings per posted date (
POSTED).
- Count job postings per posted date (
- Visualize results
- Create a line chart where:
- X-axis =
POSTED - Y-axis =
Number of Job Postings
- X-axis =
- Apply custom colors and font styles.
- Create a line chart where:
- Explanation: Write two sentences about what the graph reveals.
df.select("POSTED").show(10, truncate=False)
+--------+ |POSTED | +--------+ |6/2/2024| |6/2/2024| |6/2/2024| |6/2/2024| |6/2/2024| |6/2/2024| |6/2/2024| |6/2/2024| |6/2/2024| |6/2/2024| +--------+ only showing top 10 rows
from pyspark.sql.functions import to_date
df_dates = df.withColumn("POSTED_DATE", to_date(col("POSTED")))
df_dates.select("POSTED", "POSTED_DATE").show(10, truncate=False)
# Проверим, сколько записей имеют не null дату:
df_dates.filter(col("POSTED_DATE").isNotNull()).count()
+--------+-----------+ |POSTED |POSTED_DATE| +--------+-----------+ |6/2/2024|NULL | |6/2/2024|NULL | |6/2/2024|NULL | |6/2/2024|NULL | |6/2/2024|NULL | |6/2/2024|NULL | |6/2/2024|NULL | |6/2/2024|NULL | |6/2/2024|NULL | |6/2/2024|NULL | +--------+-----------+ only showing top 10 rows
0
spark = SparkSession.builder \
.appName("LightcastData") \
.config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
.getOrCreate()
print(pdf.shape)
print(pdf.head())
(154, 2) POSTED_DATE job_count 0 None 22 1 2024-05-01 506 2 2024-05-02 437 3 2024-05-03 679 4 2024-05-04 573
print(pdf["job_count"].describe())
count 154.000000 mean 470.766234 std 217.792964 min 22.000000 25% 288.000000 50% 493.000000 75% 626.750000 max 1050.000000 Name: job_count, dtype: float64
from pyspark.sql import SparkSession
from pyspark.sql.functions import to_date, count, col
import plotly.express as px
import plotly.io as pio
# Установка рендерера для VS Code
pio.renderers.default = "vscode"
# Если ты ещё не создавал SparkSession, используй этот блок:
spark = SparkSession.builder \
.appName("LightcastData") \
.config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
.getOrCreate()
# Преобразуем дату из строки в формат Date
df_dates = df.withColumn("POSTED_DATE", to_date(col("POSTED"), "MM/dd/yyyy"))
# Группировка по дате и подсчёт количества вакансий
date_grouped = df_dates.groupBy("POSTED_DATE").agg(count("*").alias("job_count"))
# Конвертация в pandas DataFrame
pdf = date_grouped.orderBy("POSTED_DATE").toPandas()
# Построение линейного графика
fig = px.line(
pdf,
x="POSTED_DATE",
y="job_count",
title="Job Posting Trends Over Time",
template="plotly_white"
)
# Настройка внешнего вида
fig.update_layout(
font=dict(family="Arial", size=14),
title_font=dict(size=20),
xaxis_title="Date Posted",
yaxis_title="Number of Job Postings"
)
# Показываем график прямо под ячейкой в VS Code
fig.show()
This line chart shows the fluctuation in job postings over time. Peaks may correspond to seasonal hiring periods, while dips may reflect holidays or economic slowdowns.
Top 10 Job Titles by Count¶
- Identify the most frequently posted job titles.
- Aggregate Data
- Count the occurrences of each job title (
TITLE_NAME). - Select the top 10 most frequent titles.
- Count the occurrences of each job title (
- Visualize results
- Create a bar chart where:
- X-axis =
TITLE_NAME - Y-axis =
Job Count
- X-axis =
- Apply custom colors and font styles.
- Create a bar chart where:
- Explanation: Write two sentences about what the graph reveals.
from pyspark.sql.functions import col, count
import plotly.express as px
import plotly.io as pio
# Устанавливаем рендерер для отображения графика в VS Code
pio.renderers.default = "vscode"
# Группируем по TITLE_NAME и считаем количество
title_counts = df.groupBy("TITLE_NAME").agg(count("*").alias("job_count"))
# Отбираем топ-10
top_titles = title_counts.orderBy(col("job_count").desc()).limit(10)
# Преобразуем в pandas
pdf = top_titles.toPandas()
# Строим bar chart
fig = px.bar(
pdf,
x="TITLE_NAME",
y="job_count",
title="Top 10 Job Titles by Count",
template="plotly_white",
text="job_count"
)
# Настройка внешнего вида
fig.update_layout(
font=dict(family="Arial", size=14),
title_font=dict(size=20),
xaxis_title="Job Title",
yaxis_title="Number of Postings"
)
# Поворачиваем подписи X-оси для читаемости
fig.update_xaxes(tickangle=45)
# Показываем график
fig.show()
This bar chart highlights the top 10 most frequently posted job titles. The most common roles likely reflect current labor market demands and widespread recruitment for essential positions.
Remote vs On-Site Job Postings¶
- Compare the proportion of remote and on-site job postings.
- Aggregate Data
- Count job postings by remote type (
REMOTE_TYPE_NAME).
- Count job postings by remote type (
- Visualize results
- Create a pie chart where:
- Labels =
REMOTE_TYPE_NAME - Values =
Job Count
- Labels =
- Apply custom colors and font styles.
- Create a pie chart where:
- Explanation: Write two sentences about what the graph reveals.
from pyspark.sql.functions import count
import plotly.express as px
import plotly.io as pio
# Настройка рендерера
pio.renderers.default = "vscode"
# Считаем количество вакансий по типу удалённости
remote_counts = df.groupBy("REMOTE_TYPE_NAME").agg(count("*").alias("job_count"))
# Конвертируем в pandas
pdf = remote_counts.toPandas()
# Строим круговую диаграмму
fig = px.pie(
pdf,
names="REMOTE_TYPE_NAME",
values="job_count",
title="Remote vs On-Site Job Postings",
template="plotly_white"
)
# Настраиваем внешний вид
fig.update_layout(
font=dict(family="Arial", size=14),
title_font=dict(size=20)
)
fig.show()
This pie chart shows the distribution of job postings by work arrangement type. It reveals whether remote, hybrid, or on-site positions are more commonly offered in the current job market.
Skill Demand Analysis by Industry (Stacked Bar Chart)¶
- Identify which skills are most in demand in various industries.
- Aggregate Data
- Extract skills from job postings.
- Count occurrences of skills grouped by NAICS industry codes.
- Visualize results
- Create a stacked bar chart where:
- X-axis =
Industry - Y-axis =
Skill Count - Color =
Skill
- X-axis =
- Apply custom colors and font styles.
- Create a stacked bar chart where:
- Explanation: Write two sentences about what the graph reveals.
from pyspark.sql import SparkSession
# Запуск Spark с увеличенной памятью и поддержкой парсинга дат
spark = SparkSession.builder \
.appName("LightcastData") \
.config("spark.driver.memory", "4g") \
.config("spark.sql.legacy.timeParserPolicy", "LEGACY") \
.getOrCreate()
# Загрузка данных
df = spark.read.option("header", "true") \
.option("inferSchema", "true") \
.option("multiLine", "true") \
.option("escape", "\"") \
.csv("./data/lightcast_job_postings.csv")
df.show(5)
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/03/25 03:21:26 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
25/03/25 03:21:41 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
+--------------------+-----------------+----------------------+----------+--------+---------+--------+--------------------+--------------------+--------------------+-----------+-------------------+--------------------+--------------------+---------------+----------------+--------+--------------------+-----------+-------------------+----------------+---------------------+-------------+-------------------+-------------+------------------+---------------+--------------------+--------------------+--------------------+-------------+------+-----------+----------------+-------------------+---------+-----------+--------------------+--------------------+-------------+------+--------------+-----+--------------------+-----+----------+---------------+--------------------+---------------+--------------------+------------+--------------------+------------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+----------+---------------+----------+---------------+---------------+--------------------+--------------+--------------------+--------------------------+-------------------------------+--------------------+-------------------------+-----------------------------+----------------------------------+-----------------+----------------------+-----------------------+----------------------------+------------------+-----------------------+-------+--------------------+-------+--------------------+-------+---------------+-------+---------------+-----------------+----------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+
| ID|LAST_UPDATED_DATE|LAST_UPDATED_TIMESTAMP|DUPLICATES| POSTED| EXPIRED|DURATION| SOURCE_TYPES| SOURCES| URL|ACTIVE_URLS|ACTIVE_SOURCES_INFO| TITLE_RAW| BODY|MODELED_EXPIRED|MODELED_DURATION| COMPANY| COMPANY_NAME|COMPANY_RAW|COMPANY_IS_STAFFING|EDUCATION_LEVELS|EDUCATION_LEVELS_NAME|MIN_EDULEVELS| MIN_EDULEVELS_NAME|MAX_EDULEVELS|MAX_EDULEVELS_NAME|EMPLOYMENT_TYPE|EMPLOYMENT_TYPE_NAME|MIN_YEARS_EXPERIENCE|MAX_YEARS_EXPERIENCE|IS_INTERNSHIP|SALARY|REMOTE_TYPE|REMOTE_TYPE_NAME|ORIGINAL_PAY_PERIOD|SALARY_TO|SALARY_FROM| LOCATION| CITY| CITY_NAME|COUNTY| COUNTY_NAME| MSA| MSA_NAME|STATE|STATE_NAME|COUNTY_OUTGOING|COUNTY_NAME_OUTGOING|COUNTY_INCOMING|COUNTY_NAME_INCOMING|MSA_OUTGOING| MSA_NAME_OUTGOING|MSA_INCOMING| MSA_NAME_INCOMING|NAICS2| NAICS2_NAME|NAICS3| NAICS3_NAME|NAICS4| NAICS4_NAME|NAICS5| NAICS5_NAME|NAICS6| NAICS6_NAME| TITLE| TITLE_NAME| TITLE_CLEAN| SKILLS| SKILLS_NAME| SPECIALIZED_SKILLS|SPECIALIZED_SKILLS_NAME| CERTIFICATIONS| CERTIFICATIONS_NAME| COMMON_SKILLS| COMMON_SKILLS_NAME| SOFTWARE_SKILLS|SOFTWARE_SKILLS_NAME| ONET| ONET_NAME| ONET_2019| ONET_2019_NAME| CIP6| CIP6_NAME| CIP4| CIP4_NAME| CIP2| CIP2_NAME|SOC_2021_2| SOC_2021_2_NAME|SOC_2021_3| SOC_2021_3_NAME|SOC_2021_4|SOC_2021_4_NAME|SOC_2021_5|SOC_2021_5_NAME|LOT_CAREER_AREA|LOT_CAREER_AREA_NAME|LOT_OCCUPATION| LOT_OCCUPATION_NAME|LOT_SPECIALIZED_OCCUPATION|LOT_SPECIALIZED_OCCUPATION_NAME|LOT_OCCUPATION_GROUP|LOT_OCCUPATION_GROUP_NAME|LOT_V6_SPECIALIZED_OCCUPATION|LOT_V6_SPECIALIZED_OCCUPATION_NAME|LOT_V6_OCCUPATION|LOT_V6_OCCUPATION_NAME|LOT_V6_OCCUPATION_GROUP|LOT_V6_OCCUPATION_GROUP_NAME|LOT_V6_CAREER_AREA|LOT_V6_CAREER_AREA_NAME| SOC_2| SOC_2_NAME| SOC_3| SOC_3_NAME| SOC_4| SOC_4_NAME| SOC_5| SOC_5_NAME|LIGHTCAST_SECTORS|LIGHTCAST_SECTORS_NAME|NAICS_2022_2| NAICS_2022_2_NAME|NAICS_2022_3| NAICS_2022_3_NAME|NAICS_2022_4| NAICS_2022_4_NAME|NAICS_2022_5| NAICS_2022_5_NAME|NAICS_2022_6| NAICS_2022_6_NAME|
+--------------------+-----------------+----------------------+----------+--------+---------+--------+--------------------+--------------------+--------------------+-----------+-------------------+--------------------+--------------------+---------------+----------------+--------+--------------------+-----------+-------------------+----------------+---------------------+-------------+-------------------+-------------+------------------+---------------+--------------------+--------------------+--------------------+-------------+------+-----------+----------------+-------------------+---------+-----------+--------------------+--------------------+-------------+------+--------------+-----+--------------------+-----+----------+---------------+--------------------+---------------+--------------------+------------+--------------------+------------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+----------+---------------+----------+---------------+---------------+--------------------+--------------+--------------------+--------------------------+-------------------------------+--------------------+-------------------------+-----------------------------+----------------------------------+-----------------+----------------------+-----------------------+----------------------------+------------------+-----------------------+-------+--------------------+-------+--------------------+-------+---------------+-------+---------------+-----------------+----------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+
|1f57d95acf4dc67ed...| 9/6/2024| 2024-09-06 20:32:...| 0|6/2/2024| 6/8/2024| 6| [\n "Company"\n]|[\n "brassring.c...|[\n "https://sjo...| []| NULL|Enterprise Analys...|31-May-2024\n\nEn...| 6/8/2024| 6| 894731| Murphy USA| Murphy USA| false| [\n 2\n]| [\n "Bachelor's ...| 2| Bachelor's degree| NULL| NULL| 1|Full-time (> 32 h...| 2| 2| false| NULL| 0| [None]| NULL| NULL| NULL|{\n "lat": 33.20...|RWwgRG9yYWRvLCBBUg==|El Dorado, AR| 5139| Union, AR|20980| El Dorado, AR| 5| Arkansas| 5139| Union, AR| 5139| Union, AR| 20980| El Dorado, AR| 20980| El Dorado, AR| 44| Retail Trade| 441|Motor Vehicle and...| 4413|Automotive Parts,...| 44133|Automotive Parts ...|441330|Automotive Parts ...|ET29C073C03D1F86B4|Enterprise Analysts|enterprise analys...|[\n "KS126DB6T06...|[\n "Merchandisi...|[\n "KS126DB6T06...| [\n "Merchandisi...| []| []|[\n "KS126706DPF...|[\n "Mathematics...|[\n "KS440W865GC...|[\n "SQL (Progra...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...|[\n "45.0601",\n...|[\n "Economics, ...|[\n "45.06",\n ...|[\n "Economics",...|[\n "45",\n "27...|[\n "Social Scie...| 15-0000|Computer and Math...| 15-2000|Mathematical Scie...| 15-2050|Data Scientists| 15-2051|Data Scientists| 23|Information Techn...| 231010|Business Intellig...| 23101011| General ERP Analy...| 2310| Business Intellig...| 23101011| General ERP Analy...| 231010| Business Intellig...| 2310| Business Intellig...| 23| Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists| [\n 7\n]| [\n "Artificial ...| 44| Retail Trade| 441|Motor Vehicle and...| 4413|Automotive Parts,...| 44133|Automotive Parts ...| 441330|Automotive Parts ...|
|0cb072af26757b6c4...| 8/2/2024| 2024-08-02 17:08:...| 0|6/2/2024| 8/1/2024| NULL| [\n "Job Board"\n]| [\n "maine.gov"\n]|[\n "https://job...| []| NULL|Oracle Consultant...|Oracle Consultant...| 8/1/2024| NULL| 133098|Smx Corporation L...| SMX| true| [\n 99\n]| [\n "No Educatio...| 99|No Education Listed| NULL| NULL| 1|Full-time (> 32 h...| 3| 3| false| NULL| 1| Remote| NULL| NULL| NULL|{\n "lat": 44.31...| QXVndXN0YSwgTUU=| Augusta, ME| 23011| Kennebec, ME|12300|Augusta-Watervill...| 23| Maine| 23011| Kennebec, ME| 23011| Kennebec, ME| 12300|Augusta-Watervill...| 12300|Augusta-Watervill...| 56|Administrative an...| 561|Administrative an...| 5613| Employment Services| 56132|Temporary Help Se...|561320|Temporary Help Se...|ET21DDA63780A7DC09| Oracle Consultants|oracle consultant...|[\n "KS122626T55...|[\n "Procurement...|[\n "KS122626T55...| [\n "Procurement...| []| []| []| []|[\n "BGSBF3F508F...|[\n "Oracle Busi...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...| []| []| []| []| []| []| 15-0000|Computer and Math...| 15-2000|Mathematical Scie...| 15-2050|Data Scientists| 15-2051|Data Scientists| 23|Information Techn...| 231010|Business Intellig...| 23101012| Oracle Consultant...| 2310| Business Intellig...| 23101012| Oracle Consultant...| 231010| Business Intellig...| 2310| Business Intellig...| 23| Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists| NULL| NULL| 56|Administrative an...| 561|Administrative an...| 5613| Employment Services| 56132|Temporary Help Se...| 561320|Temporary Help Se...|
|85318b12b3331fa49...| 9/6/2024| 2024-09-06 20:32:...| 1|6/2/2024| 7/7/2024| 35| [\n "Job Board"\n]|[\n "dejobs.org"\n]|[\n "https://dej...| []| NULL| Data Analyst|Taking care of pe...| 6/10/2024| 8|39063746| Sedgwick| Sedgwick| false| [\n 2\n]| [\n "Bachelor's ...| 2| Bachelor's degree| NULL| NULL| 1|Full-time (> 32 h...| 5| NULL| false| NULL| 0| [None]| NULL| NULL| NULL|{\n "lat": 32.77...| RGFsbGFzLCBUWA==| Dallas, TX| 48113| Dallas, TX|19100|Dallas-Fort Worth...| 48| Texas| 48113| Dallas, TX| 48113| Dallas, TX| 19100|Dallas-Fort Worth...| 19100|Dallas-Fort Worth...| 52|Finance and Insur...| 524|Insurance Carrier...| 5242|Agencies, Brokera...| 52429|Other Insurance R...|524291| Claims Adjusting|ET3037E0C947A02404| Data Analysts| data analyst|[\n "KS1218W78FG...|[\n "Management"...|[\n "ESF3939CE1F...| [\n "Exception R...|[\n "KS683TN76T7...|[\n "Security Cl...|[\n "KS1218W78FG...|[\n "Management"...|[\n "KS126HY6YLT...|[\n "Microsoft O...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...| []| []| []| []| []| []| 15-0000|Computer and Math...| 15-2000|Mathematical Scie...| 15-2050|Data Scientists| 15-2051|Data Scientists| 23|Information Techn...| 231113|Data / Data Minin...| 23111310| Data Analyst| 2311| Data Analysis and...| 23111310| Data Analyst| 231113| Data / Data Minin...| 2311| Data Analysis and...| 23| Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists| NULL| NULL| 52|Finance and Insur...| 524|Insurance Carrier...| 5242|Agencies, Brokera...| 52429|Other Insurance R...| 524291| Claims Adjusting|
|1b5c3941e54a1889e...| 9/6/2024| 2024-09-06 20:32:...| 1|6/2/2024|7/20/2024| 48| [\n "Job Board"\n]|[\n "disabledper...|[\n "https://www...| []| NULL|Sr. Lead Data Mgm...|About this role:\...| 6/12/2024| 10|37615159| Wells Fargo|Wells Fargo| false| [\n 99\n]| [\n "No Educatio...| 99|No Education Listed| NULL| NULL| 1|Full-time (> 32 h...| 3| NULL| false| NULL| 0| [None]| NULL| NULL| NULL|{\n "lat": 33.44...| UGhvZW5peCwgQVo=| Phoenix, AZ| 4013| Maricopa, AZ|38060|Phoenix-Mesa-Chan...| 4| Arizona| 4013| Maricopa, AZ| 4013| Maricopa, AZ| 38060|Phoenix-Mesa-Chan...| 38060|Phoenix-Mesa-Chan...| 52|Finance and Insur...| 522|Credit Intermedia...| 5221|Depository Credit...| 52211| Commercial Banking|522110| Commercial Banking|ET2114E0404BA30075|Management Analysts|sr lead data mgmt...|[\n "KS123QX62QY...|[\n "Exit Strate...|[\n "KS123QX62QY...| [\n "Exit Strate...| []| []|[\n "KS7G6NP6R6L...|[\n "Reliability...|[\n "KS4409D76NW...|[\n "SAS (Softwa...|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...| []| []| []| []| []| []| 15-0000|Computer and Math...| 15-2000|Mathematical Scie...| 15-2050|Data Scientists| 15-2051|Data Scientists| 23|Information Techn...| 231113|Data / Data Minin...| 23111310| Data Analyst| 2311| Data Analysis and...| 23111310| Data Analyst| 231113| Data / Data Minin...| 2311| Data Analysis and...| 23| Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists| [\n 6\n]| [\n "Data Privac...| 52|Finance and Insur...| 522|Credit Intermedia...| 5221|Depository Credit...| 52211| Commercial Banking| 522110| Commercial Banking|
|cb5ca25f02bdf25c1...| 6/19/2024| 2024-06-19 07:00:...| 0|6/2/2024|6/17/2024| 15|[\n "FreeJobBoar...|[\n "craigslist....|[\n "https://mod...| []| NULL|Comisiones de $10...|Comisiones de $10...| 6/17/2024| 15| 0| Unclassified| LH/GM| false| [\n 99\n]| [\n "No Educatio...| 99|No Education Listed| NULL| NULL| 3|Part-time / full-...| NULL| NULL| false| 92500| 0| [None]| year| 150000| 35000|{\n "lat": 37.63...| TW9kZXN0bywgQ0E=| Modesto, CA| 6099|Stanislaus, CA|33700| Modesto, CA| 6|California| 6099| Stanislaus, CA| 6099| Stanislaus, CA| 33700| Modesto, CA| 33700| Modesto, CA| 99|Unclassified Indu...| 999|Unclassified Indu...| 9999|Unclassified Indu...| 99999|Unclassified Indu...|999999|Unclassified Indu...|ET0000000000000000| Unclassified|comisiones de por...| []| []| []| []| []| []| []| []| []| []|15-2051.01|Business Intellig...|15-2051.01|Business Intellig...| []| []| []| []| []| []| 15-0000|Computer and Math...| 15-2000|Mathematical Scie...| 15-2050|Data Scientists| 15-2051|Data Scientists| 23|Information Techn...| 231010|Business Intellig...| 23101012| Oracle Consultant...| 2310| Business Intellig...| 23101012| Oracle Consultant...| 231010| Business Intellig...| 2310| Business Intellig...| 23| Information Techn...|15-0000|Computer and Math...|15-2000|Mathematical Scie...|15-2050|Data Scientists|15-2051|Data Scientists| NULL| NULL| 99|Unclassified Indu...| 999|Unclassified Indu...| 9999|Unclassified Indu...| 99999|Unclassified Indu...| 999999|Unclassified Indu...|
+--------------------+-----------------+----------------------+----------+--------+---------+--------+--------------------+--------------------+--------------------+-----------+-------------------+--------------------+--------------------+---------------+----------------+--------+--------------------+-----------+-------------------+----------------+---------------------+-------------+-------------------+-------------+------------------+---------------+--------------------+--------------------+--------------------+-------------+------+-----------+----------------+-------------------+---------+-----------+--------------------+--------------------+-------------+------+--------------+-----+--------------------+-----+----------+---------------+--------------------+---------------+--------------------+------------+--------------------+------------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------+--------------------+------------------+-------------------+--------------------+--------------------+--------------------+--------------------+-----------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+----------+--------------------+----------+--------------------+----------+---------------+----------+---------------+---------------+--------------------+--------------+--------------------+--------------------------+-------------------------------+--------------------+-------------------------+-----------------------------+----------------------------------+-----------------+----------------------+-----------------------+----------------------------+------------------+-----------------------+-------+--------------------+-------+--------------------+-------+---------------+-------+---------------+-----------------+----------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+------------+--------------------+
only showing top 5 rows
df.columns
['ID', 'LAST_UPDATED_DATE', 'LAST_UPDATED_TIMESTAMP', 'DUPLICATES', 'POSTED', 'EXPIRED', 'DURATION', 'SOURCE_TYPES', 'SOURCES', 'URL', 'ACTIVE_URLS', 'ACTIVE_SOURCES_INFO', 'TITLE_RAW', 'BODY', 'MODELED_EXPIRED', 'MODELED_DURATION', 'COMPANY', 'COMPANY_NAME', 'COMPANY_RAW', 'COMPANY_IS_STAFFING', 'EDUCATION_LEVELS', 'EDUCATION_LEVELS_NAME', 'MIN_EDULEVELS', 'MIN_EDULEVELS_NAME', 'MAX_EDULEVELS', 'MAX_EDULEVELS_NAME', 'EMPLOYMENT_TYPE', 'EMPLOYMENT_TYPE_NAME', 'MIN_YEARS_EXPERIENCE', 'MAX_YEARS_EXPERIENCE', 'IS_INTERNSHIP', 'SALARY', 'REMOTE_TYPE', 'REMOTE_TYPE_NAME', 'ORIGINAL_PAY_PERIOD', 'SALARY_TO', 'SALARY_FROM', 'LOCATION', 'CITY', 'CITY_NAME', 'COUNTY', 'COUNTY_NAME', 'MSA', 'MSA_NAME', 'STATE', 'STATE_NAME', 'COUNTY_OUTGOING', 'COUNTY_NAME_OUTGOING', 'COUNTY_INCOMING', 'COUNTY_NAME_INCOMING', 'MSA_OUTGOING', 'MSA_NAME_OUTGOING', 'MSA_INCOMING', 'MSA_NAME_INCOMING', 'NAICS2', 'NAICS2_NAME', 'NAICS3', 'NAICS3_NAME', 'NAICS4', 'NAICS4_NAME', 'NAICS5', 'NAICS5_NAME', 'NAICS6', 'NAICS6_NAME', 'TITLE', 'TITLE_NAME', 'TITLE_CLEAN', 'SKILLS', 'SKILLS_NAME', 'SPECIALIZED_SKILLS', 'SPECIALIZED_SKILLS_NAME', 'CERTIFICATIONS', 'CERTIFICATIONS_NAME', 'COMMON_SKILLS', 'COMMON_SKILLS_NAME', 'SOFTWARE_SKILLS', 'SOFTWARE_SKILLS_NAME', 'ONET', 'ONET_NAME', 'ONET_2019', 'ONET_2019_NAME', 'CIP6', 'CIP6_NAME', 'CIP4', 'CIP4_NAME', 'CIP2', 'CIP2_NAME', 'SOC_2021_2', 'SOC_2021_2_NAME', 'SOC_2021_3', 'SOC_2021_3_NAME', 'SOC_2021_4', 'SOC_2021_4_NAME', 'SOC_2021_5', 'SOC_2021_5_NAME', 'LOT_CAREER_AREA', 'LOT_CAREER_AREA_NAME', 'LOT_OCCUPATION', 'LOT_OCCUPATION_NAME', 'LOT_SPECIALIZED_OCCUPATION', 'LOT_SPECIALIZED_OCCUPATION_NAME', 'LOT_OCCUPATION_GROUP', 'LOT_OCCUPATION_GROUP_NAME', 'LOT_V6_SPECIALIZED_OCCUPATION', 'LOT_V6_SPECIALIZED_OCCUPATION_NAME', 'LOT_V6_OCCUPATION', 'LOT_V6_OCCUPATION_NAME', 'LOT_V6_OCCUPATION_GROUP', 'LOT_V6_OCCUPATION_GROUP_NAME', 'LOT_V6_CAREER_AREA', 'LOT_V6_CAREER_AREA_NAME', 'SOC_2', 'SOC_2_NAME', 'SOC_3', 'SOC_3_NAME', 'SOC_4', 'SOC_4_NAME', 'SOC_5', 'SOC_5_NAME', 'LIGHTCAST_SECTORS', 'LIGHTCAST_SECTORS_NAME', 'NAICS_2022_2', 'NAICS_2022_2_NAME', 'NAICS_2022_3', 'NAICS_2022_3_NAME', 'NAICS_2022_4', 'NAICS_2022_4_NAME', 'NAICS_2022_5', 'NAICS_2022_5_NAME', 'NAICS_2022_6', 'NAICS_2022_6_NAME']
df.select("NAICS2_NAME", "SPECIALIZED_SKILLS_NAME").show(5, truncate=False)
+------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |NAICS2_NAME |SPECIALIZED_SKILLS_NAME | +------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |Retail Trade |[\n "Merchandising",\n "Predictive Modeling",\n "Data Modeling",\n "Advanced Analytics",\n "Data Extraction",\n "Statistical Analysis",\n "Data Mining",\n "Business Analysis",\n "Finance",\n "Algorithms",\n "Statistics",\n "SQL (Programming Language)",\n "Ad Hoc Reporting",\n "Power BI",\n "Economics"\n] | |Administrative and Support and Waste Management and Remediation Services|[\n "Procurement",\n "Financial Statements",\n "Oracle Business Intelligence (BI) / OBIA",\n "Oracle E-Business Suite",\n "PL/SQL",\n "Supply Chain",\n "Business Intelligence",\n "Oracle Fusion Middleware",\n "Project Accounting"\n] | |Finance and Insurance |[\n "Exception Reporting",\n "Data Analysis",\n "Data Integrity"\n] | |Finance and Insurance |[\n "Exit Strategies",\n "User Story",\n "Hardware Configuration Management",\n "On Prem",\n "Agile Methodology",\n "Solution Design",\n "Advanced Analytics",\n "Reengineering",\n "Cross-Functional Collaboration",\n "Requirements Elicitation",\n "Business Analysis",\n "Data Management",\n "Data Architecture",\n "Market Trend",\n "Business Valuation",\n "Systems Development Life Cycle",\n "Test Planning",\n "Multi-Tenant Cloud Environments",\n "Scrum (Software Development)",\n "Project Management",\n "Data Migration",\n "Regulatory Compliance",\n "Product Roadmaps",\n "SAS (Software)",\n "Software As A Service (SaaS)",\n "Data Domain",\n "Product Requirements",\n "Data Governance",\n "Competitive Intelligence",\n "Operations Architecture",\n "Risk Appetite",\n "Google Cloud Platform (GCP)",\n "User Feedback"\n]| |Unclassified Industry |[] | +------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ only showing top 5 rows
df.select("NAICS2_NAME", "SKILLS_NAME").show(5, truncate=False)
+------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |NAICS2_NAME |SKILLS_NAME | +------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ |Retail Trade |[\n "Merchandising",\n "Mathematics",\n "Presentations",\n "Predictive Modeling",\n "Data Modeling",\n "Advanced Analytics",\n "Data Extraction",\n "Statistical Analysis",\n "Data Mining",\n "Business Analysis",\n "Finance",\n "Algorithms",\n "Statistics",\n "SQL (Programming Language)",\n "Report Writing",\n "Ad Hoc Reporting",\n "Power BI",\n "Relationship Building",\n "Economics",\n "Business Administration"\n] | |Administrative and Support and Waste Management and Remediation Services|[\n "Procurement",\n "Financial Statements",\n "Oracle Business Intelligence (BI) / OBIA",\n "Oracle E-Business Suite",\n "PL/SQL",\n "Supply Chain",\n "Business Intelligence",\n "Oracle Fusion Middleware",\n "Project Accounting"\n] | |Finance and Insurance |[\n "Management",\n "Exception Reporting",\n "Report Writing",\n "Security Clearance",\n "Interpersonal Communications",\n "Ability To Meet Deadlines",\n "Presentations",\n "Writing",\n "Data Analysis",\n "Organizational Skills",\n "Negotiation",\n "Data Integrity",\n "Microsoft Office"\n] | |Finance and Insurance |[\n "Exit Strategies",\n "Reliability",\n "User Story",\n "Management",\n "Strategic Planning",\n "Hardware Configuration Management",\n "On Prem",\n "Agile Methodology",\n "Solution Design",\n "Advanced Analytics",\n "Reengineering",\n "Safety Assurance",\n "Cross-Functional Collaboration",\n "Requirements Elicitation",\n "Business Analysis",\n "Data Management",\n "Data Architecture",\n "Influencing Skills",\n "Market Trend",\n "Business Valuation",\n "Creativity",\n "Innovation",\n "Governance",\n "Systems Development Life Cycle",\n "Leadership",\n "Test Planning",\n "Multi-Tenant Cloud Environments",\n "Scrum (Software Development)",\n "Project Management",\n "Operations",\n "Data Migration",\n "Regulatory Compliance",\n "Product Roadmaps",\n "SAS (Software)",\n "Troubleshooting (Problem Solving)",\n "Quality Assurance",\n "Software As A Service (SaaS)",\n "Data Domain",\n "Product Requirements",\n "Data Governance",\n "Competitive Intelligence",\n "Operations Architecture",\n "Risk Appetite",\n "Google Cloud Platform (GCP)",\n "User Feedback"\n]| |Unclassified Industry |[] | +------------------------------------------------------------------------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+ only showing top 5 rows
print(pdf.shape)
pdf.head()
(1000, 3)
| NAICS2_NAME | SPECIALIZED_SKILLS_NAME | skill_count | |
|---|---|---|---|
| 0 | Finance and Insurance | [\n "Workflow Management",\n "Agile Methodol... | 4 |
| 1 | Administrative and Support and Waste Managemen... | [\n "Help Desk Support"\n] | 2 |
| 2 | Professional, Scientific, and Technical Services | [\n "Agile Methodology",\n "Erwin (Data Mode... | 4 |
| 3 | Administrative and Support and Waste Managemen... | [\n "Policy Analysis",\n "Futures Exchange",... | 1 |
| 4 | Finance and Insurance | [\n "Milestones (Project Management)",\n "Wo... | 56 |
rows = filtered_skills.limit(1000).collect()
import pandas as pd
pdf = pd.DataFrame(rows, columns=["NAICS2_NAME", "SKILL", "skill_count"])
from pyspark.sql.functions import split, explode, col, count
import plotly.express as px
import plotly.io as pio
import pandas as pd
# Настройка рендерера
pio.renderers.default = "notebook" # можно также попробовать "vscode", "iframe", "browser"
# 1. Преобразуем строку в массив по запятой (с учётом возможных пробелов)
df_cleaned = df.withColumn("SKILL_LIST", split(col("SKILLS_NAME"), ",\\s*"))
# 2. Взрываем массив в отдельные строки
exploded_df = df_cleaned.select("NAICS2_NAME", explode(col("SKILL_LIST")).alias("SKILL"))
# 3. Убираем лишние кавычки и пробелы
exploded_df = exploded_df.withColumn("SKILL", col("SKILL").substr(2, 100)) # убрать лишние символы вроде [" и "]
# 4. Считаем частоту по индустриям
skill_counts = exploded_df.groupBy("NAICS2_NAME", "SKILL") \
.agg(count("*").alias("skill_count"))
# 5. Топ-5 отраслей
top_industries = df.groupBy("NAICS2_NAME") \
.count() \
.orderBy(col("count").desc()) \
.limit(5)
top_industries_list = [row["NAICS2_NAME"] for row in top_industries.collect()]
# 6. Фильтрация по этим отраслям
filtered_skills = skill_counts.filter(col("NAICS2_NAME").isin(top_industries_list))
# 7. Переводим в pandas вручную
rows = filtered_skills.limit(1000).collect()
pdf = pd.DataFrame(rows, columns=["NAICS2_NAME", "SKILL", "skill_count"])
# 8. Визуализация
fig = px.bar(
pdf,
x="NAICS2_NAME",
y="skill_count",
color="SKILL",
title="Top Skills by Industry (from SKILLS_NAME as string)",
template="plotly_white"
)
fig.update_layout(
font=dict(family="Arial", size=14),
title_font=dict(size=20),
xaxis_title="Industry",
yaxis_title="Skill Count"
)
fig.show()
This stacked bar chart displays the most common skills extracted from the SKILLS_NAME string field across the top five industries. The preprocessing step split stringified skill lists into individual skills, revealing which specific competencies are in highest demand by sector.
Salary Analysis by ONET Occupation Type (Bubble Chart)¶
- Analyze how salaries differ across ONET occupation types.
- Aggregate Data
- Compute median salary for each occupation in the ONET taxonomy.
- Visualize results
- Create a bubble chart where:
- X-axis =
ONET_NAME - Y-axis =
Median Salary - Size = Number of job postings
- X-axis =
- Apply custom colors and font styles.
- Create a bubble chart where:
- Explanation: Write two sentences about what the graph reveals.
df.select("ONET_NAME").distinct().show(10, truncate=False)
[Stage 23:> (0 + 1) / 1]
+------------------------------+ |ONET_NAME | +------------------------------+ |Business Intelligence Analysts| |NULL | +------------------------------+
df.filter(col("ONET_NAME").isNotNull()).count()
72454
from pyspark.sql.functions import col, count, expr
import pandas as pd
import plotly.express as px
import plotly.io as pio
pio.renderers.default = "notebook"
# ⚠️ Фильтрация ТОЛЬКО по ONET_NAME (не исключаем зарплаты с 0 заранее)
filtered_df = df.filter(col("ONET_NAME").isNotNull())
# Считаем медианную зарплату и количество вакансий
median_salary_df = filtered_df.groupBy("ONET_NAME") \
.agg(
expr("percentile_approx(SALARY_FROM, 0.5)").alias("median_salary"),
count("*").alias("job_postings")
)
# Удаляем строки без зарплаты
median_salary_df = median_salary_df.filter(col("median_salary").isNotNull())
# Сортируем по популярности профессий и берём топ-100
rows = median_salary_df.orderBy(col("job_postings").desc()).limit(100).collect()
# Преобразуем в pandas
pdf = pd.DataFrame(rows, columns=["ONET_NAME", "median_salary", "job_postings"])
# Визуализация
fig = px.scatter(
pdf,
x="ONET_NAME",
y="median_salary",
size="job_postings",
title="Median Salary by ONET Occupation Type",
template="plotly_white"
)
fig.update_layout(
font=dict(family="Arial", size=14),
title_font=dict(size=20),
xaxis_title="Occupation (ONET)",
yaxis_title="Median Salary",
xaxis_tickangle=45,
height=600
)
fig.show()
This bubble chart displays the median salary for the Business Intelligence Analysts occupation based on job postings in the dataset. The size of the bubble reflects the number of job openings for this role, emphasizing its strong demand and salary level in the current job market.
Career Pathway Trends (Sankey Diagram)¶
- Visualize job transitions between different occupation levels.
- Aggregate Data
- Identify career transitions between SOC job classifications.
- Visualize results
- Create a Sankey diagram where:
- Source =
SOC_2021_2_NAME - Target =
SOC_2021_3_NAME - Value = Number of transitions
- Source =
- Apply custom colors and font styles.
- Create a Sankey diagram where:
- Explanation: Write two sentences about what the graph reveals.
from pyspark.sql.functions import col, count
import plotly.graph_objects as go
import pandas as pd
# 1. Фильтрация строк с непустыми названиями уровней SOC
pathways_df = df.filter(
col("SOC_2021_2_NAME").isNotNull() & col("SOC_2021_3_NAME").isNotNull()
)
# 2. Группировка: сколько раз SOC-2 → SOC-3
grouped = pathways_df.groupBy("SOC_2021_2_NAME", "SOC_2021_3_NAME") \
.agg(count("*").alias("count"))
# 3. Получаем данные
rows = grouped.collect()
pdf = pd.DataFrame(rows, columns=["source", "target", "value"])
# 4. Создаём список всех уникальных узлов
all_nodes = list(pd.unique(pdf[["source", "target"]].values.ravel("K")))
# 5. Преобразуем source/target в индексы
pdf["source_id"] = pdf["source"].apply(lambda x: all_nodes.index(x))
pdf["target_id"] = pdf["target"].apply(lambda x: all_nodes.index(x))
# 6. Строим Sankey
fig = go.Figure(data=[go.Sankey(
node=dict(
pad=15,
thickness=20,
line=dict(color="black", width=0.5),
label=all_nodes,
color="lightblue"
),
link=dict(
source=pdf["source_id"],
target=pdf["target_id"],
value=pdf["value"],
color="rgba(236, 116, 36, 0.4)" # кастомный цвет потока
)
)])
fig.update_layout(
title_text="Career Pathways: SOC-2021 Level 2 to Level 3",
font=dict(family="Arial", size=14),
title_font=dict(size=20)
)
fig.show()
This Sankey diagram illustrates how job roles flow from broader SOC level 2 classifications to more specific SOC level 3 categories. It reveals dominant career pathways and the specialization of roles within key occupational domains.